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ABSTRACT 

Contracts are a form of lightweight formal specification em- 
bedded in the program text. Being part of the code, they 
encourage programmers to devote the proper attention to 
specifications and help maintain consistency between specifi- 
cation and implementation as the program evolves. For ver- 
ification, contracts can be evaluated at run time, and hence 
support dynamic analysis. The present study investigates 
the connection between contracts and software evolution. 
Based on an extensive empirical analysis of 15 contract- 
equipped Eiffel and C# projects totaling 94 development- 
years, it explores, among other questions: 1) how specifi- 
cations are used over time; 2) which kinds of specification 
element (preconditions, postconditions, class invariants) are 
used more often; 3) the relationship between code changes 
and specification changes; and 4) the role of inheritance in 
the process. It has found, among other results, that: the 
percentage of program elements that include specifications 
(contracts) is above 33% for most projects and tends to be 
stable over time; there is no strong preference for a certain 
type of specification element; specifications are quite sta- 
ble compared to implementations; and inheritance does not 
significantly affect qualitative trends of specification usage. 

1. INTRODUCTION 

What happens to specifications when the code changes? 
In many cases, they are not updated to reflect the latest evo- 
lution of the software they are supposed to document. This 
creates a vicious cycle where specifications become useless 
because obsolete, and are not updated because considered 
useless [37]. Techniques such as contracts |29| and literate 
programming [25] try to break this vicious cycle by combin- 
ing code and specification: if the effort to maintain them 
synchronized is lower, updating the specification is not per- 
ceived as a vain effort, but a constituent part of the devel- 
opment process. 

Few people question the value of having accurate specifica- 
tions; as we highlight in Section [3] various activities of devel- 



opment, testing, and analysis can become more precise and 
automated if specifications are available 21 . However, the 



software projects that systematically deploy specifications — 
especially formal specifications — are still a minority. The 
extensive empirical study of this paper thoroughly analyzes 
a large selection of projects belonging to this group, with the 
goal of studying how specifications can be written, changed, 
and maintained as part of general software development. 

The study targets applications and libraries written in Eif- 
fel and C#, two object-oriented languages supporting con- 
tracts, which provide executable formal specifications in the 
form of pre and postconditions and class invariants — a form 
of rigorous documentation amenable to quantitative and au- 
tomated analysis. Eiffel has always supported contracts 
natively, whereas C# has added them only recently with 
the Code Contracts framework [14], but the number of C# 
projects using contracts is growing. For readers not familiar 
with contracts. Section [2] gives an essential introduction. 

Overall, our study analyzed more than 210 million lines 
of code and specification distributed over 5900 revisions or 
94 years of project life. Section [1] describes the data collec- 
tion process]^ To our knowledge, this is the first extensive 
study of the practical usage of simple specifications such as 
contracts and their evolution. 

The specific study's questions target various aspects of 
using contracts: Is the usage of contracts quantitatively sig- 
nificant and uniform across the various projects? How does 
it evolve over time? How does it change with the overall 
project size? What kinds of contracts are used more often? 
What happens to contracts when implementations change? 
What is the role of inheritance? 

The findings of the study, described in Section[5] include: 

• The analyzed projects make a significant usage of con- 
tracts: the percentages of routines and classes with 
specification is above 33% in the majority of projects. 

• The percentage usage of specifications tends to be sta- 
ble over time, except for the occasional turbulent phases 
where major refactorings are performed. 

• There is no strong preference for some kinds of specifi- 
cation elements (preconditions, postconditions, class 
invariants); but preconditions, when they are used, 
tend to be larger (more clauses) than postconditions. 

• Specifications are quite stable compared to implemen- 
tations: a routine's body may change often, but its 
contracts will change infrequently. 

^The complete experimental data is available online [? 13 . 
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• Inheritance does not significantly affect tlie qualitative 
findings about specification usage: measures including 
and excluding inherited contracts tend to correlate. 

These findings shed some light on how lightweight for- 
mal specifications can be used in practice, and suggest that 
riping their benefits is feasible — it becomes "possible and 
practical to produce better documentation" [37| . 

2. CONTRACTS AS SPECIFICATION 

Contracts [29] are the form of lightweight specification 
that we consider in this paper; therefore, we will use the 
terms "contract" and "specification" as synonyms. This sec- 
tion gives a concise overview of the semantics of contracts 
and of how they can change with the implementation. The 
presentation uses a simplified example written in pseudo- 
Eiffel code (see Figure [TJ which is, however, representative 
of features found in real code (see Section [5|. 

Consider a class MEASURE used to store a sequence of 
measures, each represented by an integer number; the se- 
quence is stored in a list as attribute data. Figure [T] shows 
two revisions of class MEASURE, formatted so as to high- 
light the lines of code or specification added in revision 2. 
MEASURE includes specification elements in the form of 
preconditions (require), postconditions (ensure), and class 
invariants (invariant). Each element includes one or more 
clauses, one per line; the clauses are logically anded. For 
example, routine (method) add^datum has one precondition 
clause on line [S] and, in revision 2, another clause on line|9] 

Contract clauses use the same syntax as Boolean expres- 
sions of the programming language; therefore, they are exe- 
cutable and can be checked at runtime. A routine's precon- 
dition must hold whenever the routine is called; the caller 
is responsible for satisfying the precondition of its callee. A 
routine's postcondition must hold whenever the routine ter- 
minates execution; the routine body is responsible for satis- 
fying the postcondition upon termination. A class invariant 
specifies the "stable" object states between consecutive rou- 
tine calls: it must hold whenever a new object of the class 
is created and after every (public) routine call terminates]^ 
In Figure [l] (left), routine add^datum must be called with 
an actual argument d that represents a measure not already 
stored in the list data (precondition on line [8|; when the 
routine terminates, the list data must not be Void (class in- 
variant on line |19| l and not empty (postcondition on line |ll[ ). 

Revision 2 of class MEASURE, in Figure [ij (right), in- 
troduces changes in the code and in the specification. Some 
contracts become stronger: a precondition clause and a class 
invariant clause are added on lines [9] and 1221 Routine 
copy^data changes its implementation: revision 2 checks 
whether newAist is empty (line |17[ ) before copying its el- 
ements, so as to satisfy the new invariant clause on line [22] 

3. WHY WE SHOULD CARE ABOUT 
CHANGING SPECIFICATIONS 

Since specifications in the form of contracts are executable, 
their changes over the development of a software project may 
directly affect the ways in which the software is developed, 
and automatically tested and analyzed. We now sketch a 
few practical examples that motivate the empirical analysis. 

^The sem ant ics of class invariants is more subtle in the gen- 
eral case [29] but the details are unimportant here. 



Testing is a widely used verification technique based on 
executing a system to find failures that reveal errors. Testing 
requires oracles |41[ |21[ [19] to determine if the call to a 
certain routine is valid and produces the expected result. 
Since contracts contracts can be evaluated at runtime like 
any other program expression, they can serve as completely 
automatic testing oracles. Previous work (to mention just a 
few: [27] [6j [3] [47] [42] [49] [30] [45] [40] ) has built "push button" 
testing frameworks that use contracts. 

The effective usage of contracts as oracles in software test- 
ing |17[ |23[ [l] rests on some assumptions. Besides the ob- 
vious requirements that contracts be available and main- 
tained, how pre and postconditions change affects the testa- 
bility of routines. The stronger the precondition of a routine 
r is, the harder testing the routine, because more calls to r 
become invalid. In Figure [T] adcLdatum is harder to test in 
revision 2 because it has a stronger precondition. On the 
other hand, a stronger precondition makes r's clients more 
easily testable for errors, in that there is a higher chance 
that a test suite will trigger a precondition violation that 
does not comply with r's stricter specification. This is the 
case of copy_data in the example of Section [2] which calls 
add_datum and hence may fail to satisfy the latter's stronger 
precondition in revision 2. Conversely, a stronger postcon- 
dition makes a routine itself easier to test for errors. 

Conflict analysis. Specifications in the form of con- 
tracts can help detect potential indirect conflicts [4 35 be- 
tween code maintained by different developers. Syntactic 
changes to the contract of a public routine may indicate con- 
flicts in its clients, if the syntactic changes reflect a changed 
routine semantics. Thus, using syntactic changes as indi- 
cators of possible indirect conflicts is workable only if con- 
tracts change much less frequently than implementations, so 
that following changes in the former generates only a limited 
number of warnings; and, conversely, only if specifications 
are consistently changed when the semantics of the imple- 
mentation changes, so as to produce few false negatives. 

Object retrieval error detection. Changes in the at- 
tributes of a class may affect the capability to retrieve previ- 
ously stored objects 32, 9, 38 . Class invariants help detect 
when inconsistent objects stored in a previous revision are 
introduced in the system: they express properties of the ob- 
ject state, and hence of attributes, that every valid object 
must satisfy. In Figure [l] objects stored in revision 1 with 
an empty data list cannot be retrieved after the code has 
evolved into revision 2 because they break the new class in- 
variant. Knowing whether developers consistently add in- 
variant clauses for describing constraints on newly intro- 
duced attributes tells us whether class invariants are reliable 
to detect inconsistent objects as they are retrieved. 

4. STUDY SETUP 

Our study analyzes the evolution of contract specifications 
in C# and Eiffel, covering a wide range of projects of differ- 
ent sizes and life spans and including both professional and 
student developers. The following subsections present how 
we selected the projects; the tools we developed to perform 
the analysis; and the raw measures collected. 

4.1 Data selection 

We selected 15 open-source projects using contracts and 
available in public repositories, whose revision histories are 
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1 

2 

1 class MEASURE Revision 1 ^. 



3 make do create data end 

4 7 

5 data: L/ST [INTEGER] List of data ' 

6 ® 

7 adiLdatum {d: INTEGER) 9 

8 require not data.has (d) j^q 

9 do -j^i 

10 data. append (d) 2^2 

11 ensure not data.is^empty ■^■^ 

12 14 

13 copy_data (newjist: LIST [INTEGER]) 

14 require new_list ^ Void Ig 

15 do 

16 across newAist as x: add^datum (x) 

17 18 

18 invariant 19 

19 data ^ Void 20 

21 

22 



Figure 1: Class MEASURE in revision 1 (left) and 

accessible using Subversion or Mercurial. Table [T] lists the 
projects and, for each of them, the total number of revisions, 
the life span (in weeks) , the size in lines of code at the latest 
revision, the number of developers involved (i.e., the number 
of committers to the repository), and a short description. 

The 8 Eiffel projects comprise some of the largest publicly 
available Eiffel applications and libraries, such as the Eiffel- 
Base and Gobo libraries (maintained by Eiffel Software and 
GoboSoft), as well as EiffelProgramAnalysis (developed by 
students) and AutoTest (developed by our research group). 
We selected the C# projects available on the Code Con- 
tracts webpage [s], which lists all major C# open projects 
using contracts. One of them, however, contained no C# 
contracts, and another 4 had only few revisions, therefore 
we excluded them from the study as they would have been 
out of scope; the remaining 7 C# projects include 2 large ap- 
plications mainly developed by Microsoft Research (Boogie 
and Dafny). 

With the help of the project configuration files, we man- 
ually went through all project repositories to weed out the 
artifacts not part of the main application (e.g., test suites, 
accessory library code, or informal documentation). When 
a repository contained multiple branches, we selected the 
main branch (trunk in Subversion and default in Mercurial) 
and excluded the others. 

4.2 Analysis tools 

To support analysis of large amounts of program code in 
multiple languages, we developed COAT — a "COntract Anal- 
ysis Tool". The current implementation of COAT has four 
main components: CoatRepo retrieves the complete re- 
vision history of projects; CoatEiffel and CoatC# are 
two language-specific back-ends that process Eiffel and C# 
classes and extract contracts and code into a database; COAT- 
Analyze queries the database data supplied by the back- 
ends and produces the raw measures discussed in Section [473l 
Finally, a set of R scripts read the raw data produced by 
CoatAnalyze and perform statistical data analysis. 

CoatRepo accesses Subversion and Mercurial reposito- 



class MEASURE Revision 2 

make do create data end 

data: LIST [INTEGER] List of data 

add_datuin (d: INTEGER) 
require not data.has (d) 

d>0 

do 

data, append (d) 
ensure not data.is_empty 

copy_data {newjist: LIST [INTEGER]) 
require newAist ^ Void 
do 

if not new_list . is_empty t\\ei:i 

across new_Ust as x: add_datu'm (x) 

invariant 

data ^ Void 

not data.is_enipty 



2 (right). Lines added in revision 2 are shadowed. 

ries, checks out all revisions of a project, and stores them 
locally together with other relevant data such as commit 
dates, messages, and authors. We used this additional data 
to investigate unexpected behavior, such as sudden extreme 
changes in project sizes, as we mention in Section [S] 

CoatEiffel parses Eiffel classes, extracts body and spec- 
ification elements, and stores them in a relational database, 
in a form suitable for the subsequent processing. While pars- 
ing technology is commonplace, parsing projects over a life 
span of nearly 20 years (such as EiffelBase) is challenging 
because of changes in the language syntax and semantics. 

A major question for our analysis was how to deal with 
inheritance. Routines and classes inherit contracts as well 
as implementations; when analyzing the specification of a 
routine or a class, should our measures include the inher- 
ited specification? Since we had no preliminary evidence 
to prefer one approach or the other, our tools analyze each 
class twice: once in its flat version and once in its non-flat 
version. The non-flat version of a class is limited to what 
appears in the class text. A flat class, in contrast, explic- 
itly includes all the routines (with their specification) and 
invariants of the ancestor classes. Flattening ignores, how- 
ever, library classes or framework classes that are not part of 
the repository. Reconstructing flat classes in the presence of 
multiple inheritance (supported and frequently used in Eif- 
fel) has to deal with features such as member renaming and 
redefinitions which introduce further complexity, and hence 
requires static analysis of the dependencies among the ab- 
stract syntax trees of different classes. Parsing C# is simpler 
because it only deals with the latest version of the language 
and single inheritance. The C# language has, however, its 
peculiarities that require some special processing to make 
the results comparable with Eiffel's. For example, specifi- 
cations of an interface or abstract class must appear in a 
separate child class including only the contracts; our tool 
merges these "specification classes" with their parent. Sec- 
tion |5.6| compares our measures for the fiat and non-flat 
versions of our projects; the overall conclusion is that the 
measures tend to be correlated. This is a useful piece of 
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^ PROJECT 


LANGUAGE 


^ REVISIONS 


AGE 


# LOG 


^ DEVELOPERS DESCRIPTION 


1 AutoTest 


Eiffel 


306 


195 


65'625 


13 Contract-based random testing tool 


n TIT' : cc „MI> „ „ „ 

2 rLiiiTclriasc 


Eiffel 


1342 


1006 


61'922 


45 General-purpose data structures library 


3 EiffclProgramAnalysis 


Eiffel 


208 


114 


40'750 


8 Utility library for analyzing Eiffel programs 


4 LrOboKcincl 


Eiffel 


671 


747 


53'316 


8 Library for interoperability between Eiffel compilers 


5 GoboStructure 


Eiffel 


282 


716 


21 '941 


6 Portable data structure library 


u OoDo i line 


Eiffel 


120 


524 


10'840 


6 Date and time library 


7 Oobo Utility 


Eiffel 


215 


716 


6'131 


7 Library to support design patterns 


O 1 V TV 


Eiffel 


922 


285 


163'552 


6 XML Library supporting JvbL and JvFatn 


9 Boogie 


C# 


766 


108 


88'284 


29 Program verification system 






100 


171 




3 Library to support compilers construction 


11 Dafny 


c# 


326 


106 


29'700 


19 Language and program verifier for functional correctness 


12 LabsFiamcwork 


c# 


49 


30 


14'540 


1 Library to manage experiments in .NET 


13 Quickgraph 


c# 


380 


100 


40'820 


4 Generic graph data structure library 


14 Rxx 


c# 


148 


68 


55'932 


2 Library of unofficial reactive LINQ extensions 


15 Shwcct 


c# 


59 


7 


2352 


2 Application for messaging in Twitter style 


Total 




5'894 


4'893 


676'307 


159 



Table 1: List of projects used in the study ("age" is in weeks). 



information for the continuation of our study: for tlie mea- 
sures we tooic, botii considering in detail and overlooking 
inheritance seem to lead to consistent results. 

CoatAnalyze reads the data stored in the database by 
CoatEiffel and CoatC# and computes the raw measures 
described in SectionjXs] It outputs them to CSV files, which 
are finally processed by a set of R scripts that produce tables 
with statistics (such as Table [3| and plots (such as those in 
Figure [2|. The complete set of statistics is available [7 13 
(see the appendix). 

4.3 Measures 

The long list of raw measures produced by CoAT Analyze 
includes, for each revision: 

• The number of classes, the number of classes with in- 
variants, the average number of invariant clauses per 
class, and the number of classes modified compared to 
the previous revision; 

• The number of routines (public and private), the num- 
ber of routines with non-empty precondition, with non- 
empty postcondition, and with non-empty specifica- 
tion (that is, precondition, postcondition, or both), 
the average number of pre and postcondition clauses 
per routine, and the number of routines with modified 
body compared to the previous revision. 

Measuring precisely the strength of a specification (which 
refers to how constraining it is) is hardly possible as it re- 
quires detailed knowledge of a class' semantics and estab- 
lishing undecidable properties in general (it is tantamount 
to deciding entailment for a first-order logic theory). In 
our study, we count the number of specification clauses (el- 
ements anrfed, normally on different lines) as a proxy for 
specification strength. The number of clauses is a mea- 
sure of size that is interesting in its own right. Then, if 
a (non-trivial, i.e., not identically true) clause is added to a 
specification element without changing its other clauses, we 
certainly have a strengthening; and, conversely, a weaken- 
ing when we remove a clause. If some clauses are changed, 
just counting the clauses may measure strength incorrectly. 
We have evidence, however, that the error introduced by 
measuring strengthening in this way is small. We manually 
inspected 174 changes randomly chosen, and found 6 mis- 
classifications (e.g., strengthening reported as weakening). 
Following [28( Eq. 5], this gives a 95% confidence interval 



of [1%,7%]: with 95% probability, the errors introduced by 
our estimate involve no more than 7% of the changes. 

The following table shows more data about the manual 
validation. For each language, as well as overall in the last 
line, it shows the number of errors found by manual sampling 
in each of three categories, and the total in the last column. 
Next to each number is the corresponding 95% confidence 
interval. The column categories are: changes that are syn- 
tactic but do not change the semantics (column semantic); 
contracts that have not changed but are reported as changed 
(typically because of some spurious character not parsed 
away, column change); and incorrectly reported weaken- 
ings or strengthenings (column strength). Only the errors 
about "strength" affect the results of our study (last para- 
graph of Section [5.5[ |. The "change" errors might slightly 
decrease the confidence in the results comparing changes to 
bodies and changes to implementation (rest of Section [5. 5[ ), 
since the cases of changed specification are probably slightly 
overestimated; this can only strengthen the change analysis. 
Finally, whether the "semantic" changes are really classifica- 
tion error largely depend on what we are using our measures 
for: refactorings such as a renaming an attribute normally 
require changes to implementations and contracts alike, even 
if the net result is no change to the program semantics and 
its specification. Anyway, measuring syntactic changes, as 
we did in the paper, yields meaningful results. 



Language 


INSPECTED 


SEMANTIC 




CHANGE 


STRENGTH 


Eiffel 


106 


6 [2%, 11%] 


2 


[1%, 6%] 


2 [1%,6%] 


c# 


68 


[0%,4%] 


3 


[1%, 11%] 


4 [2%, 13%] 


All 


174 


6 [1%,7%] 


5 


[1%, 6%] 


6 [1%,7%] 



5. HOW SPECIFICATIONS CHANGE 

This section presents the main findings of our study re- 
garding what kinds of specifications programmers write and 
how they change the specifications as they change the sys- 
tem. We organize the results according to the following main 
questions, addressed in each of the following subsections. 

1. Do projects make a significant usage of contracts, and 
how does the usage evolve over time? 

2. How does the usage of contracts change with projects 
growing or shrinking in size? 

3. What kinds of contract elements are used more often? 

4. What is the typical size and strength of contracts, and 
how does it change over time? 
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5. Do implementations change more often than their con- 
tracts? 

6. What is the role of inheritance on the way contracts 
change over time? 

Table [3] shows much of the quantitative data we discuss in 
the section for each project; Figure[2]displays the plots of the 
data about number and percentage of routines with specifi- 
cation for a subset of the projects. 

5.1 Writing contracts 

In the majority of projects in our study, developers de- 
voted a considerable part of their programming effort to 
writing specifications for their code. While we specifically 
target projects with some specification (and ignore the ma- 
jority of software that does not use contracts), we observe 
that most of the projects achieve significant percentages of 
routines or classes with specification. In 6 of the 15 analyzed 
projects, on average 50% or more of the routines have some 
specification (pre or postcondition); in 10 projects, 35% or 
more of the routines have specification; and only 3 projects 
have small percentages of specified routines (16% or less). 
Usage of class invariants is more varied but still consistent: 
in 8 projects, 33% or more of the classes have an invariant; 
in 5 projects, 12% or less of the classes have an invariant. 
The standard deviation of these percentages is often small 
compared to the average value over all revisions: in 10 of the 
15 projects, the latter is at least five times larger, suggesting 
that deviations from the average are normally small. Sec- 
tion [5]2] gives a quantitative confirmation of this hint about 
the stability of specification amount over time. 

Consider, for example, the EiffelBase project — a large col- 
lection of generic library classes used in most Eiffel projects. 
After an initial fast growing phase, corresponding to a still 
incipient design that is taking shape, the percentages of rou- 
tines and classes with specification stabilize around the me- 
dian values with some fluctuations that — while still signif- 
icant, as we comment on later — do not affect the overall 
trend or the average percentage of specified elements. This 
two-phase development (initial mutability followed by sta- 
bility) is present in several other projects of comparable size, 
and is sometimes extreme, such as for Boogie, where there 
is a widely varying initial phase, followed by a very sta- 
ble one where the percentages of elements with specification 
is practically constant around 30%. Analyzing the commit 
logs around the revisions of greater instability showed that 
wild variations in the specified elements coincide with major 
reengineering efforts. For Boogie, the initial project phase 
coincides with the porting of a parent project written in 
Spec# (a dialect of C#), and includes frequent alternations 
of adding and removing code from the repository; after this 
phase, the percentage of routines and classes with specifica- 
tion stabilizes to a value close to the median. 

There are a few outlier projects where the percentage 
of elements with specification is small, not kept consistent 
throughout the project's life, or both. Quickgraph, for ex- 
ample, never has more than 4% of classes with an invariant 
or routines with a postcondition, and its percentage of rou- 
tines with precondition varies twice between 12% and 21% 
in about 100 revisions. 

Public vs. private routines. The data analysis focuses 
on contracts of public routines. To determine whether trends 
are different for private routines, we visually inspected the 



plots and computer the correlation coefficienl|_Jr for the evo- 
lution of the percentages of specified public routines against 
those of private routines. The results suggest to partition the 
projects into three categories. For the 5 projects in the first 
category — AutoTest, EiffelBase, Boogie, CCI, and Dafny — 
the correlation is positive and high (r > 0.64). The 2 
projects in the second category — GoboStructure and Labs — 
have negative and substantial correlations (r < —0.47). The 
remaining 7 projects belong to the third category, charac- 
terized by correlations small in absolute value, positive or 
negative. This partitioning probably corresponds to differ- 
ent approaches to interface design and encapsulation: for 
projects in the first category, public and private routines 
always receive the same amount of specification through- 
out the project's life; projects in the second category show 
negative correlations that correspond to changes to the vis- 
ibility status of a significant fraction of the routines; visual 
inspection of projects in the third category still suggests pos- 
itive correlations between public and private routines with 
specification, but the occasional redesign upheaval reduces 
the overall value of r or the confidence level. In fact, the 
confidence level is typically smaller for projects in the third 
category; and it is not significant (p — 0.418) only for Eif- 
felProgramAnalysis which also belongs to the third cate- 
gory. Interestingly, projects with small correlations tend 
to be smaller in size with fewer routines and classes; con- 
versely, large projects may require a stricter discipline in 
defining and specifying the interface and its relations with 
the private parts, and have to adopt approaches consistent 
throughout their lives. 

5.2 Contracts and project size 

In Section [5Tj we observed that the percentage of specified 
routines and classes is fairly stable over time, especially for 
large projects in their maturity. We analyzed the correlation 
between measures of elements with specification and project 
size, and corroborated the found correlations with visual 
inspection of the graphs. 

The correlation between the number of routines or classes 
with some specification and the total number of routines or 
classes (with or without specification) is consistently strong 
and highly significant. For routines, 9 of the projects have an 
almost perfect correlation with r > 0.9 and p ~ 0; even the 2 
projects with the weakest correlation (Labs and Quickgraph) 
achieve r = 0.48 and p < 10~^ . The outlook for classes is 
quite similar; the outlier is project Boogie with a smaller 
r — 0.28 for the correlation between number of classes with 
invariants and number of all classes, but visual inspection 
still suggests that a sizable correlation exists. In all, the 
absolute number of elements with specification is normally 
synchronized with the overall size of a project, confirming 
the suggestion of Section [5.1 [ that the percentage of routines 
and classes with specification is stable over time. 

Having established that specification and project size have 
overall similar trends, we can look into finer-grained varia- 
tions of specifications over time. To estimate the relative 
effort of writing specifications, we measured the correlation 
between percentage of specified routines or classes and num- 
ber of all routines or all classes 

A first large group of projects, about half of the total 
whether we look at routines or classes, shows weak or neg- 

^AU correlation measures in the paper deploy Kendall's rank 
correlation coefficient r. 
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ligible correlations (—0.3 < r < 0.3). In this majority of 
projects, the relative effort of writing and maintaining spec- 
ifications evolves largely independently of the project size. 
Given that the overall trend is towards stable percentages, 
the high variance is normally concentrated in initial stages 
of the projects when there were few routines or classes in 
the system and changes can be momentous. GoboKernel is 
a specimen of these cases: the percentage of routines with 
postconditions varies wildly in the first 100 revisions when 
the system is still small and the developers are exploring 
different design choices and styles (in particular, they intro- 
duced their own variants of some system base classes). 

Another group of 3 projects (AutoTest, Boogie, an Dafny) 
shows strong negative correlations (r < —0.75) both be- 
tween percentage of specified routines and number of rou- 
tines and between percentage of specified classes and num- 
ber of classes. The usual cross-inspection of plots and com- 
mit logs points to two independent phenomena that account 
for the negative correlations. The first is the presence of 
large merges of project branches into the main branch; these 
give rise to strong irregularities in the absolute and rela- 
tive amount of specification used, and may reverse or in- 
troduce new specification styles and policies that affect the 
overall trends. AutoTest epitomizes this phenomenon, with 
its history clearly partitioned into two parts separated by 
a large merge at revision 150. Before the merge, the sys- 
tem is smaller with high percentages of routines and classes 
with specification; with the merge, the system grows mani- 
fold and continues growing afterward, while the percentage 
of elements with specification decreases abruptly and then 
(mostly for class invariants) continues decreasing. The sec- 
ond phenomenon that may account for negative correlations 
between percentage of specified elements and measures of 
project size is a sort of "specification fatigue" that kicks in 
as a project becomes mature and quite large. At that point, 
there might be diminishing returns for supplying more speci- 
fication, and so the percentage of elements with specification 
gracefully decreases while the project grows in size. The fa- 
tigue is, however, of small magnitude and may be just be a 
sign of maturity where a solid initial design with plenty of 
specification elements pays off in the long run to the point 
that less relative investment is sufficient to maintain a stable 
level of maintainability and quality. 

The remaining projects have significants positive correla- 
tions (r > 0.5) between either percentage of specified rou- 
tines and number of routines or between percentage of spec- 
ified classes and number of classes, but not both. In these 
special cases, it looks as if the fraction of programming ef- 
fort devoted to writing specification tends to increase with 
the absolute size of the system: when the system grows, 
proportionally more routines or classes get a specification. 
However, visual inspection suggests that, in all cases, the 
trend is ephemeral or contingent on transient phases where 
the project size changes significantly in little time. As the 
projects mature and their sizes stabilize, the other two trends 
(no correlation or negative correlation) emerge in all cases. 

5.3 Kinds of contracts 

Do programmers prefer preconditions? The normal intu- 
ition is that preconditions are simpler to write than post- 
conditions (and, for that matter, class invariants), and pro- 
grammers have immediate benefits in writing preconditions 
as opposed to postconditions: a routine's precondition de- 



fines the valid input, and hence the stronger it is the fewer 
cases the routine's body has to deal with. 

Contrary to this common assumption, the data in our 
study is not consistently lopsided towards preconditions. 2 
projects show no difference in the median percentages of rou- 
tines with precondition and with postcondition. 6 projects 
do have, on average, more routines with precondition than 
routines with postcondition, but the difference in percentage 
is less than 10% in 3 of those projects, and as high as 39% 
only in one project (Dafny). The remaining 7 projects even 
have more routines with postcondition than routines with 
precondition, although the difference is small (less than 5%) 
in 4 projects, and as high as 30% only in GoboTime. 

On the other hand, the percentage of routines with speci- 
fication (precondition, postcondition, or both) is higher than 
either one in all projects but CCI and Shweet (which have 
little specification anyway). Thus, the routines of most 
projects can be partitioned in three groups of comparable 
size: routines with only precondition, routines with only 
postcondition, and routines with both. Many exogenous 
causes may concur to determine the ultimate reasons behind 
picking one kind of contract element over another, such as 
the project domain and the different usage of different spec- 
ification elements. Our data is, however, consistent with the 
notion that programmers choose which specification to write 
according to context and routine semantics, not based on a 
priori preferences. 

A closer look at the projects where the difference between 
percentages of routines with precondition and with postcon- 
dition is significant (9% or higher) reveals another interest- 
ing pattern. All 4 projects that favor preconditions are C# 
projects: Dafny, Labs, Quickgraph, and Shweet; conversely, 
all 3 projects that favor postconditions are Eiffel projects: 
AutoTest, GoboKernel, and GoboTime. A possible expla- 
nation for this division involves the longer time that Eiffel 
has supported contracts and the principal role attributed to 
Design by Contract within the Eiffel community. C# pro- 
grammers, then, are more likely to pragmatically go for the 
immediate tangible benefits brought by preconditions as op- 
posed to postconditions; Eiffel programmers can be more 
zealous and use contracts thoroughly also for design before 
implementation. 

Class invariants. Class invariants have a somewhat dif- 
ferent status than pre or postconditions. Since class invari- 
ants must hold between consecutive routine calls (Section|2|, 
they play the role both of preconditions and of postcondi- 
tions, and hence they belong to a semantically incompara- 
ble category. The percentage of classes with invariant follow 
similar trends as pre and postconditions in most projects in 
our study. Only 3 projects stick out because they have 4% 
or less of classes with invariant, but otherwise make a signif- 
icant usage of other specification elements: Quickgraph, Eif- 
felProgramAnalysis, and Shweet|^ Compared to the others, 
Shweet has a short history and EiffelProgramAnalysis in- 
volves students as main developers rather than professionals. 
Given that the semantics of class invariant is less straightfor- 
ward than that of pre and postconditions — and can become 
quite intricate for complex programs [2] — this might be a 
factor explaining the different status of class invariants in 
these projects. A specific design style is also likely to influ- 



*CCI has 4% of classes with invariants, but it also has only 
3% of routines with precondition and no postconditions. 
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ence the usage of class invariants, as we further comment on 
in Section [5.41 

Kinds of constructs. An additional classification of 
contracts is according to the constructs they use. We gath- 
ered data about constructs of three types: expressions in- 
volving checks that a reference is Void (Eiffel) or null (C#); 
some form of finite quantification (across in Eiffel; and other 
constructs for V/3 over containers in both languages); and 
old expressions (used in postconditions to refer to values 
in the pre-state). Void checks are by far the most used: 
in Eiffel, 36%-93% of preconditions, 7%-62% of postcon- 
ditions, and 14%-86% of class invariants include a Void 
check; in C#, virtually all preconditions, postconditions, 
and class invariants include a null check (only exception: 
CCI's postconditions). Void checks are simple to write, 
and hence cost-effective, which explains their wide usage; 
this may change in the future, with the increasing adoption 
of static analyses which supersede Void checks 31 Tol. At 
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the other extreme, quantifications are very rarely used, ex- 
cept for AutoTest and some C# projects (Boogie, Dafny, 
and Quickgraph) where 1%-10% of class invariants deploy 
them. This may also change in the future, thanks to the 
progresses in inferring complex contracts [20[ |44[ [43| , and 
in methodological support 40j. The usage of old is more 
varied: C# postconditions rarely use it, whereas it features 
in as many as 39% of Eiffel postconditions. Using old may 
depend on the design style; for example, if most routines 
are side-effect free and return a value function solely of the 
input arguments there is no need to use old. 

5.4 Contract size and strength 

The data about specification size (and strength) partly 
vindicates the intuition that preconditions are more used. 
While Section [573l showed that routines are not more likely to 
have preconditions than postconditions, preconditions have 
more clauses on average than postconditions in all projects 
but GoboTime. The difference in favor of preconditions is 
larger than 0.5 clauses in 9 projects, and larger than 1 clause 
in 3 projects. CCI never deploys postconditions, and hence 
the difference between pre and postcondition clauses is im- 
material. GoboTime is a remarkable outlier: not only do 
twice as many of its routines have a postcondition than have 
precondition, but its average postcondition has 0.66 more 
clauses than its average precondition. We found no simple 
explanation for this exception, but it is certainly the result 
of deliberate design choices. 

The following two facts corroborate the idea that pro- 
grammers tend to do a better job with preconditions than 
with postconditions — even if they have no general preference 
for one or another. First, the default "trivial" precondition 
true is a perfectly reasonable precondition for routines that 
compute total functions — defined for every value of the in- 
put; a trivial postcondition is, in contrast, never satisfac- 
tory. Second, in general, "strong" postconditions are more 
complex than "strong" preconditions [io] since they have to 
describe more complex relations. 

Class invariants are not directly comparable to pre and 
postconditions, and their usage largely depends on the de- 
sign style. Class invariants apply to all routines and at- 
tributes of a class, and hence they may be used extensively 
and involve many clauses; conversely, they can also be re- 
placed by pre and postconditions in most cases, in which 
case they need not be complex or present at all [36]. In the 



majority of projects (11 out of 15), however, class invariants 
have more clauses on average than pre and postconditions. 
We might impute this difference to the traditional design 
principles for object-oriented contract-based programming, 
which attribute a significant role to class invariants 
as the preferred way to define valid object state. 

Section |5.1| observed the prevailing stability over time of 
routines with specification. Visual inspection and the values 
of standard deviation point to a qualitatively similar trend 
for specification size, measured in number of clauses. In the 
first revisions of a project, it is common to have more varied 
behavior, corresponding to the system design being defined; 
but the average strength of specifications typically reaches 
a plateau, or varies quite slowly, in mature phases. 

Project Labs is somewhat of an outlier, where the evolu- 
tion of specification strength over time has a rugged behav- 
ior. Its average number of class invariant clauses has a step 
at about revision 29, which corresponds to a merge, when it 
suddenly grows from 1.8 to 2.4 clauses per class. During the 
few following revisions, however, this figure drops quickly 
until it reaches a value only slightly higher than what it was 
before revision 29. A reasonable interpretation of what hap- 
pened is that the merge mixed classes developed indepen- 
dently with different programming styles (and, in particular, 
different attitudes to the usage of class invariants). Shortly 
after the merge, the developers refactored the new compo- 
nents to make them comply with the overall style, which is 
characterized by a certain average invariant strength. 

One final, qualitative, piece of data about specification 
strength is that in a few projects there seems to be a mod- 
erate increase in the strength of postconditions towards the 
latest revisions of the project. If this is a real phenomenon, 
it may show that, as programmers become fluent in writing 
specification, they are confident enough to go for the more 
complex postconditions, reversing their initial (moderate) 
focus on preconditions. This observation is however not ap- 
plicable to any of the largest and most mature projects we 
analyzed (e.g., EiffelBase, Boogie, Dafny). 

5.5 Implementation vs. specification changes 

A phenomenon commonly attributed to specifications is 
that they are not updated to reflect changes in the imple- 
mentation: even when programmers deploy speciflcations 
extensively in the initial phases of a project, the latter be- 
come obsolete and, eventually, useless. Is there evidence of 
such a phenomenon in the data of our projects? 

We first have to remember a peculiarity of contracts as 
opposed to other forms of specification. Contracts are ex- 
ecutable; normally, they are checked at runtime during de- 
bugging and testing sessions (and possibly also in production 
releases, if the overhead is acceptable, to allow for better er- 
ror reporting from final users). Therefore, contracts cannot 
become grossly misaligned with the implementation: incon- 
sistencies quickly generate runtime errors, which can only 
be fixed by reconciling implementations with their specifi- 
cations. By and large, the fact that a significant percentage 
of routines and classes have contracts (Section 5.11 implies 
that most of them are correct — if incomplete — specifications 
of routine or class behavior. 

The next question is then whether contracts change more 
often or less often than the implementations they specify. 
To answer, we compared two measures in the projects: for 
each revision, we count the number of routines with changed 
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Figure 2: Each row displays 3 graphs about the saiqje project. The first column plots the percentage of 
routines with non-empty precondition {pre in the legend), postcondition (post), and of classes with non- 
empty invariant (inv). The thin grey line not appearing in the legend plots the evolution of number of 
routines in the whole system and is scaled. The second column displays the average number of clauses in 
preconditions, postconditions, and class invariants. The third column displays the number of routines that 



body and changed specification (pre or postcondition) and 
compare it to the number of routines with changed body 
and unchanged specification. These measures aggregated 
over all revisions determine a pair of values {cp,up) for each 
project P: cp characterizes the frequency of changes to im- 
plementations that also originated a change in the contracts, 
whereas up characterizes the frequencies of changes to im- 
plementations only. To avoid that few revisions with very 
many changes dominate the aggregate values for a project, 
each revision contributes with a binary value to the aggre- 
gate value of a project: if no routine has undergone a 
change of that type in that revision, and 1 otherwiserl We 
performed a Wilcoxon signed-rank test comparing the cp's 
to the up's across all projects to determine if the median 
difference between the two types of events (changed body 
with and without changed specification) is statistically sig- 
nificant. The results confirm with high statistical signifi- 
cance (V = Q, p = Q.l ■ 10"'', and large effect size — Cohen's 
d > 0.9) that specification changes are quite infrequent com- 
pared to implementation changes for the same routine. A 
similar analysis ignoring routines with trivial (empty) spec- 
ification leads to the same conclusion also with statistical 
significance {V = 14, p = 9.67- 10~^, and medium effect size 
d — 0.424). While omitting the numerical data for brevity, 
other analyses along the same lines show that class invari- 
ants change infrequently when attributes are added to or 
removed from a class: the most common case is that invari- 
ants remain the same. 

When specifications do change, what happens to their 
strength measured in number of clauses? Another Wilcoxon 
signed-rank test compares the changes to pre and postcon- 
ditions and class invariants that added clauses (suggesting 
strengthening) against those that removed clauses (suggest- 
ing weakening). Since changes to specifications are in gen- 
eral infrequent, the results were not as conclusive as those 
comparing specification and implementation changes. The 
data consistently points towards strengthening being more 
frequent than weakening: V = 19 and p = 0.124 for precon- 
dition changes; V — 10 and p — 0.0141 for postcondition 
changes; V = 22.5 and p = 0.063 for invariant changes. The 
effect sizes are, however, smallish: Cohen's d is about 0.32, 
0.39, and 0.28 for preconditions, postconditions, and invari- 
ants. In all, the effect of strengthening being more frequent 
than weakening seems to be real but more data is needed to 
obtain conclusive evidence. 

5.6 Inheritance and contracts 

Inheritance is a staple in object-oriented programming, 
and involves contracts as well as implementations; we now 
evaluate its effects on the findings previously discussed. 

We visually inspected the plots and computed correlation 
coefficients for the percentages and average strength of spec- 
ified elements in the flat and non-fiat versions of the classes. 
In the overwhelming majority of cases, the correlations are 
high and statistically significant: 10 projects have r > 0.54 
and p < 10~^ for the percentage of routines with specifi- 
cation; 12 projects have r > 0.66 and p ~ for the per- 
centage of classes with invariant; 8 projects have r > 0.58 
and p < 10~^^ for the average precondition and postcon- 
dition strength (and 4 more projects have r > 0.38 and 
visually evident correlations); and 11 projects have r > 0.45 



and p ~ for the average invariant strength. The first-order 
conclusion is that, in most cases, ignoring the inherited spec- 
ification does not preclude understanding qualitative trends. 

What about the remaining projects, which have small or 
insignificant correlations for some of the measures in the 
fiat and non-flat versions? Visual inspection often conflrms 
the absence of significant correlations, in that the measures 
evolve along manifestly different shapes in the flat or non- 
fiat versions; the divergence in trends is typically apparent 
in the revisions where the system size changes significantly, 
where the overall design — and the inheritance hierarchy — is 
most likely to change. To see if these visible differences in- 
validate some of the findings discussed so far, we reviewed 
the findings against the data for flat classes. The big pic- 
ture was not affected: considering inheritance may affect the 
measures and offset or drift some trends, but the new mea- 
sures are still consistent with the same conclusions drawn 
from the data for non-flat classes. We now discuss the vari- 
ous questions for flat classes in more detail. 

Usage (and kinds) of contracts is qualitatively similar 
for the flat and non-flat classes. Of course, the number of 
elements with speciflcation tends to be larger in flat classes 
simply because specifications are inherited. However, the 
relative magnitude of measures such as the average, min- 
imum, maximum, and standard deviation of routines and 
classes with specification is quite similar for flat and non- 
fiat. Similar observations hold concerning measures of con- 
tract strength and their evolution. 

Project size is correlated to measures of contracts in 
similar ways in fiat and non-flat classes. The few outliers 
are projects that exhibited a positive correlation between 
percentage of specified elements and number of elements in 
the non-flat versions (e.g., t = 0.66 for EiffelBase); the cor- 
relation vanishes in the flat versions (e.g., r = —0.08 for 
EiffelBase). As we discussed at the end of Section [5. 2[ the 
positive correlations of these outliers was unusual and pos- 
sibly ephemeral; the fact that correlations dilute away when 
we consider the inheritance hierarchy reinforces the idea that 
the positive correlation trends of Section [5.2| are exceptional 
and largely inconsequential. 

Change analyses are virtually identical in flat and non- 
fiat classes; this is unsurprising since the analyses (discussed 
in Section 5.5 1 target binary differences between a version 



^ Using other "reasonable" aggregation functions (including 
exact counting) leads to qualitatively similar results. 



and the next one, so that the measures gobble up the offset 
introduced by fiattening. 

A final observation is that differences between measures 
with flat and non-flat classes tend to be smaller in the C# 
projects, as opposed to the Eiffel projects. This reveals that 
multiple inheritance, available in Eiffel but not in C#, may 
contribute to magnify differences between measures of flat 
and non-fiat classes. 



6. THREATS TO VALIDITY 

Construct validity. The raw measures taken by COAT 
expose two potential threats to construct validity. First, us- 
ing the number of clauses as a proxy for the strength of a 
specification may produce imprecise measures; Section [4.3| 
however, estimated the imprecision and showed it is limited, 
and hence an acceptable trade-off in most cases (also given 
that computing strength exactly is infeasible). Besides, the 
number of clauses is still a valuable size/complexity mea- 
sure in its own right (Section |5.4[). Second, the flattening 
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introduced to study the effect of inheritance (Section |4.2[ ) 
introduces some approximations when dealing with the most 
complex usages of multiple inheritance (select clauses) or 
of inner classes. We are confident, however, that this ap- 
proximation has a negligible impact on our measurements 
as these complex usages occur very rarely. 

Internal validity. Since we targeted object-oriented lan- 
guages where inheritance is used pervasively, it is essential 
that the inheritance structure is taken into account in the 
measures. We fully addressed this major threat to internal 
validity by analyzing all projects twice: in non-flat and flat 
version (Sections |5.6[ ). A different threat originates from 
Coat failing to parse a few flies in some revisions, due to 
the presence of invalid and outdated language syntax con- 
structs. The impact of this is certainly negligible: less than 
0.0069% of all flies could not be parsed. Restricting our 
analysis to the main branches and manually discarding ir- 
relevant content from the repositories pose another potential 
threat to interval validity. In all cases, we took great care 
to cover the most prominent development path and to se- 
lect the main content based on the project conflguration flies 
written by the developers, so as to minimize this threat. 

External validity. Our study is restricted to two for- 
malisms for writing contract speciflcations: Eiffel and C# 
with Code Contracts. While other notations for contracts 
(e.g., JML) are similar, we did not analyze other types of 
formal speciflcation, which may limit the generalizability of 
our flndings. We also explicitly targeted projects that use 
contracts, as opposed to the overwhelming majority that 
only includes informal functional documentation or no doc- 
umentation at all. This deliberate choice limits the gen- 
eralizability of our flndings, but it also focuses the study 
on understanding how contracts can be seriously used in 
practice. (Establishing that most software projects use no 
formal speciflcation hardly requires empirical evidence.) In 
contrast, the restriction to open-source projects does not 
pose a serious threat to external validity in our study, be- 
cause several of our projects are mainly maintained by pro- 
fessional programmers (EiffelBase and Gobo projects) or by 
professional researchers in industry (Boogie and Dafny). 

7. RELATED WORK 

Section |3] discussed program analysis and other activi- 
ties where contracts can be useful. To our knowledge, this 
paper is the flrst quantitative empirical study of speciflca- 
tions in the form of contracts and their evolution together 
with code. There is evidence, however, that other forms of 
documentation — for example, comments [Ts], APIs [24], or 
tests [is] — evolve with code. 

A well-known problem is that specification and imple- 
mentation tend to diverge over time; this is more likely for 
documents such as requirements and architectural designs 
that are typically developed and stored separately from the 
source code. Much research has targeted this problem; spec- 
ification refinement, for instance, can be applied to soft- 
ware revisions T!6^. Along the same lines, some empirical 
studies analyzed how requirements relate to the correspond- 
ing implementations; [22], for example, examines the co- 
evolution of certain aspects of requirements documents with 
change logs and shows that topic-based requirements trace- 
ability can be automatically implemented from the informa- 
tion stored in version control systems. 



The information about the usage of formal specification 
by programmers is largely anecdotal, with the exceptions of 
a few surveys on industrial practices [5][46]. There is, how- 
ever, some evidence of the usefulness of contracts and asser- 
tions. pGl, for example, suggests that increases of assertions 
density and decreases of fault density correlate. [33] reports 
that using assertions may decrease the effort necessary for 
extending existing programs and increase their reliability. In 
addition, there is evidence that developers are more likely 
to use contracts in languages that support them natively [s] . 
As the technology to infer contracts from code reaches high 
precision levels 
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it is natural to compare automati- 
cally inferred and programmer-written contracts; they turn 
out to be, in general, different but with significant overlap- 
ping [39]. 

Our Coat tool (Section |4.2[ ) is part of a very large family 
of tools 18 that mine software repositories to extract quan- 
titative data. In particular, it shares some standard tech- 
nologies with other tools for source code analysis (e.g., [34| ) . 

8. CONCLUSION 

This paper presented an extensive empirical study of the 
evolution of specifications in the form of contracts. The 
study targeted 15 projects written in Eiffel and C# (using 
Code Contracts) over a total of 94 years of development. 
The main results show that the percentages of routines with 
pre or postcondition and of classes with invariants is above 
33% for most projects; that these percentages tend to be 
stable over time, if we discount special events like merge or 
major refactorings; and that specifications change much less 
often than implementations — which makes a good case for 
stable interfaces over changing implementations. 
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10. APPENDIX: COMPLETE STATISTICS 

This appendix contains several tables with all statistics 
discussed in the paper. In all tables, the few missing data 
about project Shweet are due to the fact that the project 
lacks class invariants, and hence the corresponding statistics 
are immaterial. 

General specification statistics. Table [3] lists vari- 
ous general statistics about specifications: # of classes, % 
of classes with invariant, # of routines, % of routines with 
specification (pre or postcondition), % of routines with pre- 
condition, % of routines with postcondition, average num- 
ber of clauses in preconditions, average number of clauses 
in postconditions, average number of clauses in class invari- 
ants. Table ID lists the same data but for fiat classes. 

Table |5] lists other statistics about projects: language, 
number of revisions, age in weeks, lines of code, and then 
some of the same statistics about classes and routines as in 
Table m Table [6] lists the same data for fiat classes. 

Change correlation analysis. Table [T] lists Wilcoxon 
signed-rank tests about changes, as described in Section [53] 
comparing: changing and non-changing specifications of rou- 
tines whose body changes; changing and non-changing NE 
(i.e., non-empty) specifications of routines whose body chang- 
es; preconditions becoming weaker vs. becoming stronger, 
NE preconditions becoming weaker vs. becoming stronger, 
postconditions becoming weaker vs. becoming stronger, NE 
postconditions becoming weaker vs. becoming stronger, class 
invariants becoming weaker vs. becoming stronger, NE class 
invariants becoming weaker vs. becoming stronger, class in- 
variants becoming weaker when attributes are added vs. it 
not changing in strength; class invariants becoming stronger 
when attributes are added vs. it not changing in strength; 
class invariants becoming weaker when attributes are re- 
moved vs. it not changing in strength; class invariants be- 
coming stronger when attributes are removed vs. it not chang- 
ing in strength. The statistics are V and p from the signed- 
rank test; A(/i) is the difference in medians, whose value is 
positive iff the first — between the two compared measures — 
median is larger; d is Cohen's effect size ((mi — m^^la where 
mi , nil are the means of the two compared measures and 
G is the standard deviation of the whole measured data), 
whose value is positive iff the first — between the two com- 
pared measures — mean is larger. The top half of the table 
considers non-fiat classes, whereas the bottom half considers 
fiat classes. Table [S] shows the results of the same analysis 
but done by summing all changes instead of counting them 
with a binary value for each revision (see Section [5.5[ ). 

Flat vs. non-flat correlation analysis. Table [5] lists 
correlation statistics between the evolution of the following 
measures in fiat and non-fiat classes: % of routines with 
specification; % of classes with invariants; average number 
of precondition clauses; average number of postcondition 
clauses, average number of invariant clauses. 

Size correlation analysis. Table [lO] lists correlation 
statistics between several pair of measures: % of routines 
with specification and total # of routines; # of routines 
with specification and total # of routines; % of classes with 
invariant and total # of classes; # of classes with invari- 
ant and total # of classes, average number of precondition 
clauses and % of routines with precondition; average number 
of postcondition clauses and % of routines with postcondi- 
tion; average number of invariant clauses and % of classes 



with invariants. Table [TT] lists the same statistics for flat 
classes. 

Public vs. non-public correlation analysis. Table 
lists correlation statistics between: % of public routines with 
precondition and % of non-public routines with precondi- 
tion; % of public routines with postcondition and % of non- 
public routines with postcondition; % of public routines with 
specification and % of non-public routines with specification. 
Table [131 lists the same statistics for fiat classes. 

Kinds of constructs. Table [TH lists statistics about con- 
structs used in contracts: % of preconditions with Void or 
null checks; % of preconditions with some form of quantifi- 
cation; % of postconditions with Void or null checks; % of 
postconditions with some form of quantification; % of post- 
conditions with old; % of class invariants with Void or null 
checks; % of class invariants with some form of quantifica- 
tion. 
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Table 7: Change analysis with majority for non-flattened (top) and flattened (bottom) classes. 
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O.OOE+OO 
6.10E-05 
-2.28E+02 
-8.26E-01 



3.90E+01 
2.52E-01 
-8.00E+00 
-2.04E-01 



3.20E+01 

3.63E-01 
-2.00E+00 
8.27E-02 



2.05E+01 

2.86E-01 
O.OOE+OO 
-5.13E-01 



2.00E+01 

2.43E-02 
-4.00E+00 
-4.08E-01 



O.OOE+OO 

O.OOE+OO 
O.OOE+OO 



1.30E+01 

2.51E-02 
-4.00E+00 
-4.41E-01 



2.05E+01 
2.85E-01 
-l.OOE+00 
-2.19E-01 



O.OOE+OO 

1.43E-02 
-2.00E+00 
-7.10E-01 



O.OOE+OO 

1.41E-02 
-2.00E+00 
-4.97E-01 



3.50E+00 

9.04E-02 
O.OOE+OO 
-4.04E-01 



O.OOE+OO 

2.23E-02 
O.OOE+OO 
-6.44E-01 



O.OOE+OO 
6.10E-05 
-2.28E+02 
-8.26E-01 



3.90E+01 
2.52E-01 
■8.00E+00 
-2.04E-01 



3.20E+01 
3.63E-01 

-2.00E+00 
8.27E-02 



2.05E+01 
2.86E-01 
O.OOE+OO 
-5.13E-01 



2.00E+01 
2.43E-02 
-4.00E+00 
-4.08E-01 



O.OOE+OO 

O.OOE+OO 
O.OOE+OO 



1.30E+01 
2.51E-02 
-4.00E+00 
-4.41E-01 



2.05E+01 
2.85E-01 
■l.OOE+OO 
-2.19E-01 



O.OOE+OO 
1.43E-02 
■2.00E+00 
-7.10E-01 



O.OOE+OO 
1.41E-02 
-2.00E+00 
-4.97E-01 



3.50E+00 
9.04E-02 
O.OOE+OO 
-4.04E-01 



O.OOE+OO 
2.23E-02 
O.OOE+OO 
-6.44E-01 



Table 8: Change analysis for non-flattened (top) and flattened (bottom) classes. 
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% ROUTINES W/ 


SPEC 


% CLASSES W/ 


INV 


AVG PRE STRENGTH 


AVG POST 


STRENGTH 


AVG INV STRENGTH 


Project 


P 


r 


P 


T 


P 


T 


P 


T 


P 


T 


AutoTest 


O.OOE+00 


0.65 


O.OOE+00 


0.98 


O.OOE+00 


0.60 


O.OOE+00 


0.75 


1.55E-04 


0.16 


EiflfelBase 


O.OOE+00 


0.17 


O.OOE+00 


0.27 


O.OOE+00 


0.62 


O.OOE+00 


0.38 


O.OOE+00 


0.46 


EiffelProgr am Analysis 


8.98E-01 


-0.01 


O.OOE+00 


0.66 


4.75E-12 


0.34 


1.24E-10 


0.31 


O.OOE+00 


0.95 


GoboKernel 


O.OOE+00 


0.60 


O.OOE+00 


0.86 


O.OOE+00 


0.89 


O.OOE+00 


0.36 


O.OOE+00 


0.97 


GoboStructure 


8.88E-01 


0.01 


1.44E-03 


0.13 


O.OOE+00 


0.38 


O.OOE+00 


0.70 


O.OOE+00 


0.45 


GoboTime 


4.28E-01 


-0.05 


O.OOE+00 


0.81 


O.OOE+00 


0.58 


O.OOE+00 


0.89 


1.41E-06 


0.36 


GoboUtility 


O.OOE+00 


0.92 


4.19E-02 


-0.10 


O.OOE+00 


0.80 


O.OOE+00 


1.00 


O.OOE+00 


0.90 


GoboXML 


O.OOE+00 


0.28 


O.OOE+00 


0.77 


O.OOE+00 


0.78 


O.OOE+00 


0.76 


O.OOE+00 


0.79 


Boogie 


O.OOE+00 


0.91 


O.OOE+00 


1.00 


3.65E-08 


-0.14 


O.OOE+00 


0.29 


O.OOE+00 


0.71 


CGI 


O.OOE+00 


0.98 


O.OOE+00 


1.00 


O.OOE+00 


0.97 


O.OOE+00 


1.00 


O.OOE+00 


1.00 


Dafny 


O.OOE+00 


0.90 


O.OOE+00 


1.00 


O.OOE+00 


0.45 


O.OOE+00 


0.78 


4.52E-06 


-0.18 


Labs 


9.79E-10 


0.64 


O.OOE+00 


1.00 


O.OOE+00 


0.97 


O.OOE+00 


0.94 


4.19E-02 


-0.22 


Quickgraph 


O.OOE+00 


0.99 


O.OOE+00 


1.00 


O.OOE+00 


1.00 


O.OOE+00 


1.00 


O.OOE+00 


1.00 


Rxx 


O.OOE+00 


0.54 


O.OOE+00 


1.00 


4.20E-11 


0.38 


O.OOE+00 


0.92 


O.OOE+00 


1.00 


shweet 


O.OOE+00 


0.99 






O.OOE+00 


1.00 


2.62E-14 


1.00 






Overall 


O.OOE+00 


0.41 


O.OOE+00 


0.52 


O.OOE+00 


0.79 


O.UOE+00 


0.78 


O.OOE+00 


0.39 



Table 9: Correlation of measures in flat and non-flat cleisses. 
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% SPEC/# 


r(«;tines 


# SPEC/# 


roi;tikes 


% INV/# CLASSES 


# INV/# CLASSES 


avg/% pre 


avg/% post 


.AVG/% 


NY 


Project 


P 




r 


P 




r 


P 


T 


P 




r 


P 


r 


p 


r 


P 


r 


AutoTest 


1.02E-94 


-0 


82 


O.OOE+00 





93 


1.21E-117 


-0.93 


O.OOE+00 





76 


O.OOE+00 


0.65 


O.OOE+00 


0.43 


O.OOE+00 


0.78 


EiffelBase 


O.OOE+00 





66 


O.OOE+00 





98 


1.13E-08 


0.11 


O.OOE+00 





95 


O.OOE+00 


0.44 


O.OOE+00 


0.26 


O.OOE+00 


0.19 


EiffelProgramAnalysis 


1.17E-10 





31 


O.OOE+00 





98 


8.25E-04 


-0.17 


O.OOE+00 





85 


1.69E-01 


0.07 


4.94E-01 


0.03 


5.23E-01 


-0.04 


GoboKcrnel 


9.98E-04 


-0 


09 


O.OOE+00 





97 


O.OOE+00 


0.35 


O.OOE+00 





92 


2.71E-05 


-0.11 


1.42E-01 


-0.04 


6.90E-11 


0.18 


GoboStructure 


3.04E-05 





17 


O.OOE+00 





53 


5.69E-07 


-0.21 


O.OOE+00 





70 


4.23E-04 


0.15 


O.OOE+00 


0.52 


6.58E-01 


0.02 


GoboTimc 


2.16E-04 


-0 


24 


O.OOE+00 





91 


6.16E-13 


0.51 


O.OOE+00 





91 


1.55E-05 


0.30 


7.87E-10 


0.41 


6.79E-02 


0.13 


GoboUtility 


1.94E-32 


-0 


56 


O.OOE+00 





99 


O.OOE+00 


0.60 


O.OOE+00 





95 


O.OOE+00 


0.54 


O.OOE+00 


0.58 


2.83E-13 


0.36 


GoboXML 


5.77E-74 


-0 


40 


O.OOE+00 





97 


O.OOE+00 


0.57 


O.OOE+00 





94 


3.65E-03 


0.07 


O.OOE+00 


0.51 


O.OOE+00 


0.57 


Boogie 


O.OOE+00 


-0 


95 


O.OOE+00 





56 


8.94E-209 


-0.77 


O.OOE+00 





28 


6.75E-32 


-0.30 


1.85E-05 


0.12 


O.OOE+00 


0.57 


CCI 


O.OOE+00 





67 


O.OOE+00 





76 


9.32E-01 


0.01 


2.90E-12 





58 


2.88E-10 


0.48 


O.OOE+00 


0.91 


4.31E-04 


-0.29 


Dafny 


4.77E-100 


-0 


81 


O.OOE+00 





97 


1.94E-88 


-0.77 


O.OOE+00 





65 


O.OOE+00 


0.32 


O.OOE+00 


0.53 


6.67E-13 


0.29 


Labs 


4.64E-01 


-0 


08 


6.07E-06 





48 


5.23E-08 


0.58 


8.88E-16 





88 


7.32E-14 


-0.81 


1.46E-03 


0.34 


1.15E-08 


0.62 


Quickgraph 


1.89E-06 





17 


O.OOE+00 





48 


O.OOE+00 


0.63 


O.OOE+00 





83 


6.66E-02 


0.07 


O.OOE+00 


0.79 


O.OOE+00 


0.54 


Rxx 


4.76E-41 


-0 


76 


O.OOE+00 





97 


4.32E-03 


0.17 


O.OOE+00 





96 


1.67E-11 


-0.38 


1.86E-33 


-0.70 


5.77E-01 


0.03 


shwcct 


1.44E-03 





32 


4.44E-16 





84 












O.OOE+00 


0.87 


4.23E-11 


0.77 






Overall 


3.20E-40 


-0.12 


O.OOE+00 


0.81 


O.OOE+00 


0.18 


O.OOE+00 


0.66 


O.OOE+00 


0.44 


O.OOE+00 


0.12 


O.OOE+00 


0.45 



Table 10: Correlation of meeisures of specification in non-flat classes. 





% spec/# 


RXJL'TINES 


# SPEC/# 


ROL'TIKES 


%, INV/# CLASSES 


# INV/# CLASSES 


.avg/% pre 


avg/% post 


.AVG/% 


NV 


Project 


p 


T 


P 




r 


P 


r 


P 




r 


p 




r 


p 


r 


P 


T 


AutoTest 


1.28E-70 


-0.70 


O.OOE+00 





92 


3.09E-117 


-0.93 


O.OOE+00 





88 


O.OOE+00 





51 


O.OOE+00 


0.38 


1.42E-01 


-0.06 


EiffelBase 


8.85E-06 


-0.08 


O.OOE+00 





92 


9.35E-35 


-0.23 


O.OOE+00 





94 


O.OOE+00 





33 


7.64E-24 


-0.19 


O.OOE+00 


0.16 


EiffelProgramAnalysis 


1.51E-14 


0.37 


O.OOE+00 





98 


8.13E-19 


-0.44 


O.OOE+00 





89 


O.OOE+00 





50 


4.93E-07 


0.25 


6.44E-13 


-0.39 


GoboKernel 


O.OOE+00 


0.29 


O.OOE+00 





99 


O.OOE+00 


0.47 


O.OOE+00 





92 


7.60E-05 


-0 


10 


3.25E-64 


-0.45 


4.99E-11 


0.18 


GoboStructure 


6.58E-01 


0.02 


O.OOE+00 





98 


3.63E-13 


0.30 


O.OOE+00 





72 


O.OOE+00 





53 


8.23E-43 


-0.56 


5.37E-14 


0.31 


GoboTime 


2.41E-01 


0.08 


O.OOE+00 





98 


2.22E-16 


0.58 


O.OOE+00 





88 


4.86E-05 





27 


1.87E-02 


0.16 


O.OOE+00 


0.80 


GoboUtility 


2.47E-34 


-0.58 


O.OOE+00 





99 


3.70E-04 


-0.17 


O.OOE+00 





97 


8.88E-16 





39 


2.66E-05 


0.21 


1.36E-03 


0.16 


GoboXML 


O.OOE+00 


0.20 


O.OOE+00 





98 


O.OOE+00 


0.69 


O.OOE+00 





95 


4.88E-15 





17 


2.95E-10 


0.14 


O.OOE+00 


0.69 


Boogie 


8.05E-274 


-0.87 


O.OOE+00 





89 


8.94E-209 


-0.77 


O.OOE+00 





28 


O.OOE+00 





51 


4.31E-01 


0.02 


O.OOE+00 


0.42 


CCI 


O.OOE+00 


0.64 


O.OOE+00 





75 


9.32E-01 


0.01 


2.90E-12 





58 


3.48E-11 





50 


O.OOE+00 


0.92 


4.31E-04 


-0.29 


Dafny 


9.03E-93 


-0.77 


O.OOE+00 





97 


1.94E-88 


-0.77 


O.OOE+00 





65 


O.OOE+00 





60 


O.OOE+00 


0.51 


4.52E-80 


-0.73 


Labs 


8.77E-06 


-0.46 


9.08E-03 





28 


5.23E-08 


0.58 


8.88E-16 





88 


1.48E-16 


-0 


89 


2.58E-03 


0.32 


l.OOE-06 


-0.53 


Quickgraph 


1.92E-06 


0.17 


O.OOE+00 





48 


O.OOE+00 


0.63 


O.OOE+00 





83 


7.10E-02 





06 


O.OOE+00 


0.80 


O.OOE+00 


0.54 


Rxx 


3.92E-09 


-0.33 


O.OOE+00 





97 


4.32E-03 


0.17 


O.OOE+00 





96 


3.79E-09 


-0 


34 


9.97E-11 


-0.37 


5.77E-01 


0.03 


shweet 


2.08E-03 


0.31 


4.44E-16 





84 












O.OOE+00 





88 


4.15E-11 


0.78 






Overall 


6.27E-76 


-0.16 


O.OOE+00 


0.86 


O.OOE+00 


0.09 


O.OOE+00 


0.63 


O.OOE+00 


0.22 


O.OOE+00 


0.28 


O.OOE+00 


0.55 



Table 11: Correlation of measures of specification in flat classes. 
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Jr roject 


% PRE public/ 
P 


PRIVATE 
T 


% POST PUBLIC 
P 


/private 

T 


% SPEC public/ 
P 


PRIVATE 
T 


Auto lest 


U.UU-tj+UU 


U. ( D 




U.Oi 


A AAL' 1 Af^ 


U. / 






U. I o 




^7 

U.O 1 


nnF,-i-nn 




EiffclPrograni Analysis 


2.16E-09 


0.29 


O.OOE+00 


0.49 


4.18E-01 


0.04 


GoboKcrncl 


1.17E-05 


-0.12 


8.31E-01 


-0.01 


1.48E-39 


-0.36 


GoboStmcture 


9.01E-27 


-0.44 


1.05E-36 


-0.52 


3.27E-30 


-0.47 


GoboTimc 


1.31E-03 


-0.22 


2.35E-07 


0.34 


7.48E-05 


-0.26 


GoboUtility 


1.59E-02 


0.12 


O.OOE+00 


0.65 


3.59E-02 


0.10 


GoboXML 


1.57E-07 


-0.12 


2.08E-02 


0.05 


2.66E-26 


-0.24 


Boogie 


O.OOE+00 


0.94 


O.OOE+00 


0.73 


O.OOE+00 


0.94 


CCI 


O.OOE+00 


0.84 


O.OOE+00 


0.80 


O.OOE+00 


0.85 


Dafny 


O.OOE+00 


0.36 


O.OOE+00 


0.86 


O.OOE+00 


0.66 


Labs 


1.66E-09 


-0.63 


4.42E-03 


0.30 


2.20E-06 


-0.50 


Quickgraph 


1.98E-07 


0.19 


O.OOE+00 


0.80 


1.45E-07 


0.19 


Rxx 


7.66E-05 


0.23 


5.46E-06 


0.26 


2.95E-03 


0.17 


shweet 


6.28E-09 


0.59 


4.78E-05 


0.42 


3.11E-04 


0.37 


Overall 


O.OOE+00 


0.29 


O.OOE+00 


0.43 


O.OOE+00 


0.37 



Table 12: Correlation of measures between public and private routines in non-flat classes. 



Project 


% PRE PUBLIC 
P 


/private 

T 


% POST PUBLIC 
P 


/private 

T 


% SPEC PUBLIC 
P 


/private 

T 


Auto'l'est 


O.OOE+00 


0.66 


O.OOE+00 


0.61 


O.OOE+00 


0.76 


EiffelBase 


O.OOE+00 


0.41 


5.13E-18 


-0.16 


8.27E-01 


0.00 


EiffelProgramAnalysis 


l.OOE-06 


0.24 


O.OOE+00 


0.60 


2.25E-10 


0.31 


GoboKernel 


7.45E-02 


0.05 


1.07E-08 


-0.15 


3.82E-07 


-0.14 


GoboStructure 


3.17E-01 


-0.04 


O.OOE+00 


0.54 


8.73E-06 


-0.18 


GoboTime 


1.85E-03 


-0.21 


3.49E-01 


-0.06 


2.13E-06 


0.32 


GoboUtility 


1.39E-01 


0.07 


2.84E-04 


0.18 


1.19E-02 


0.12 


GoboXML 


O.OOE+00 


0.37 


2.68E-01 


0.02 


O.OOE+00 


0.19 


Boogie 


O.OOE+00 


0.68 


O.OOE+00 


0.27 


O.OOE+00 


0.84 


CCI 


O.OOE+00 


0.83 


O.OOE+00 


0.80 


O.OOE+00 


0.83 


Dafny 


O.OOE+00 


0.40 


O.OOE+00 


0.80 


O.OOE+00 


0.66 


Labs 


2.89E-11 


-0.69 


2.37E-02 


0.23 


3.94E-12 


-0.72 


Quickgraph 


4.07E-08 


0.20 


O.OOE+00 


0.81 


4.19E-09 


0.21 


Rxx 


l.OlE-04 


0.22 


O.OOE+00 


0.47 


4.77E-04 


0.20 


shweet 


9.05E-09 


0.59 


4.59E-05 


0.43 


7.61E-04 


0.34 


Overall 


O.OOE+00 


0.32 


O.OOE+00 


0.46 


O.OOE+00 


0.39 



Table 13: Correlation of measures between public and private routines in flat cleisses. 
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11. NON-FLAT CLASSES 

The following pages show the plots for the 15 projects of 
our study, considering non-flat classes. The dotted lines, 

when present, mark median values of the various quanti- 
ties. The thin continuous lines that do not appear in the 
legend track the total number of routines, classes, or both, 
whose absolute values are scaled to the range determined 
by the main data represented in the graph. When two of 
such thin lines are present, the red one tracks the number 
of classes and the aquamarine one tracks the number of rou- 
tines. When only one (red) thin lino is present, it tracks the 
number of routines or classes according to the main content 
of the graph. 
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12. FLAT CLASSES 

The following pages show the plots for the 15 projects of 
our study, considering flat classes. 
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